Report Requirements: You are required to apply the knowledge acquired during the course – working on the dataset referenced in page 2 – to create a report using RStudio and RMarkdown along with the other packages studied (mainly tidyverse and tidymodels).
Task #1: Choose at least two of the following points to answer: Is there a relationship between a movie budget and its number of votes? What about the rating? What are the genres that have the highest average rating? What are the plot keywords that have the highest average rating? Who are the highly rated directors? Who are the highly rated actors? Who are the most profitable directors? Who are the most profitable actors? Which countries have higher average rating? Which countries produced more movies? How did the number of movie rating audience evolved over the years?
Task #2: Design and implement a predictive model to find the expected movie rating score using the features supplied in the dataset. You are free to choose any of the following: The prediction method (classification, regression, or otherwise). The features needed for prediction. The Model evaluation method or criteria.
Dataset Specifications It is a dataset from the TMDB (The Movies Database) website for ~5000 movie titles separated into two files:
Movie Metadata
=====================================================================================================================
Column Name | Column Description ||====================================================================================================================
movie_id | the movie id |title | the movie title |original_language | the movie language |release_date | the movie releasedate |budget | the movie budget |revenue | the movie revenue |runtime | the movie runtime in minutes |vote_average | the movie TMDB average rating |vote_count | the movie TMDB rating users count |popularity | the movie TMDB popularity score |genres | the movie genres separated by a pipe |keywords | the movie keywords separated by a pipe |production_companies | the movie companies separated by a pipe |production_countries | the movie countries separated by a pipe |=====================================================================================================================
Movie Cast and Crew
=====================================================================================================================
Column Name | Column Description |=====================================================================================================================
movie_id | the movie id |director | the movie director name |producer | the movie producer name |actor_1 | the movie actor_1 name |actor_2 | the movie actor_2 name |actor_3 | the movie actor_3 name |=====================================================================================================================
=====================================================================================================================
Dataset Reference: https://www.kaggle.com/tmdb/tmdb-movie-metadata/ |(Attached with the dataset the script used to transform from the original dataset) |=====================================================================================================================
#Start From Here
library('tidyverse')
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.3 v purrr 0.3.4
## v tibble 3.1.1 v dplyr 1.0.5
## v tidyr 1.1.3 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.1
## Warning: package 'tibble' was built under R version 4.0.5
## Warning: package 'tidyr' was built under R version 4.0.5
## Warning: package 'dplyr' was built under R version 4.0.5
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library('rmarkdown')
library('tidymodels')
## Warning: package 'tidymodels' was built under R version 4.0.5
## -- Attaching packages -------------------------------------- tidymodels 0.1.3 --
## v broom 0.7.6 v rsample 0.0.9
## v dials 0.0.9 v tune 0.1.5
## v infer 0.5.4 v workflows 0.2.2
## v modeldata 0.1.0 v workflowsets 0.0.2
## v parsnip 0.1.5 v yardstick 0.0.8
## v recipes 0.1.16
## Warning: package 'broom' was built under R version 4.0.5
## Warning: package 'dials' was built under R version 4.0.5
## Warning: package 'scales' was built under R version 4.0.5
## Warning: package 'infer' was built under R version 4.0.5
## Warning: package 'modeldata' was built under R version 4.0.5
## Warning: package 'parsnip' was built under R version 4.0.5
## Warning: package 'recipes' was built under R version 4.0.5
## Warning: package 'rsample' was built under R version 4.0.5
## Warning: package 'tune' was built under R version 4.0.5
## Warning: package 'workflows' was built under R version 4.0.5
## Warning: package 'workflowsets' was built under R version 4.0.5
## Warning: package 'yardstick' was built under R version 4.0.5
## -- Conflicts ----------------------------------------- tidymodels_conflicts() --
## x scales::discard() masks purrr::discard()
## x dplyr::filter() masks stats::filter()
## x recipes::fixed() masks stringr::fixed()
## x dplyr::lag() masks stats::lag()
## x yardstick::spec() masks readr::spec()
## x recipes::step() masks stats::step()
## * Use tidymodels_prefer() to resolve common conflicts.
library(readr)
library(caret)
## Warning: package 'caret' was built under R version 4.0.5
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following objects are masked from 'package:yardstick':
##
## precision, recall, sensitivity, specificity
## The following object is masked from 'package:purrr':
##
## lift
library(DAAG)
## Warning: package 'DAAG' was built under R version 4.0.5
library(dplyr)
tmdb_cast_crew <- read_csv("tmdb_cast_crew.csv",
col_types = cols(movie_id = col_character()))
#View(tmdb_cast_crew)
tmdb_movies_metadata <- read_csv("tmdb_movies_metadata.csv",
col_types = cols(movie_id = col_character(),
release_date = col_date(format = "%Y-%m-%d")))
#View(tmdb_movies_metadata)
glimpse(tmdb_movies_metadata)
## Rows: 4,795
## Columns: 14
## $ movie_id <chr> "19995", "285", "206647", "49026", "49529", "559"~
## $ title <chr> "Avatar", "Pirates of the Caribbean: At World's E~
## $ original_language <chr> "en", "en", "en", "en", "en", "en", "en", "en", "~
## $ release_date <date> 2009-12-10, 2007-05-19, 2015-10-26, 2012-07-16, ~
## $ budget <dbl> 2.37e+08, 3.00e+08, 2.45e+08, 2.50e+08, 2.60e+08,~
## $ revenue <dbl> 2787965087, 961000000, 880674609, 1084939099, 284~
## $ runtime <dbl> 162, 169, 148, 165, 132, 139, 100, 141, 153, 151,~
## $ vote_average <dbl> 7.2, 6.9, 6.3, 7.6, 6.1, 5.9, 7.4, 7.3, 7.4, 5.7,~
## $ vote_count <dbl> 11800, 4500, 4466, 9106, 2124, 3576, 3330, 6767, ~
## $ popularity <dbl> 150.43758, 139.08262, 107.37679, 112.31295, 43.92~
## $ genres <chr> "Action|Adventure|Fantasy|Science Fiction", "Adve~
## $ keywords <chr> "culture clash|future|space war|space colony|soci~
## $ production_companies <chr> "Ingenious Film Partners|Twentieth Century Fox Fi~
## $ production_countries <chr> "United States of America|United Kingdom", "Unite~
glimpse(tmdb_cast_crew)
## Rows: 4,803
## Columns: 6
## $ movie_id <chr> "19995", "285", "206647", "49026", "49529", "559", "38757", "~
## $ director <chr> "James Cameron", "Gore Verbinski", "Sam Mendes", "Christopher~
## $ producer <chr> "James Cameron", "Jerry Bruckheimer", "Barbara Broccoli", "Ch~
## $ actor_1 <chr> "Sam Worthington", "Johnny Depp", "Daniel Craig", "Christian ~
## $ actor_2 <chr> "Zoe Saldana", "Orlando Bloom", "Christoph Waltz", "Michael C~
## $ actor_3 <chr> "Sigourney Weaver", "Keira Knightley", "Léa Seydoux", "Gary O~
1- Assessing Data: exploring data and giving comments over all the wired, wrong, and missing values in the data.
2- Cleaning data: clean the data by using the comments we wrote.
tmdb_movies_metadata
tmdb_cast_crew
summary(tmdb_movies_metadata)
## movie_id title original_language release_date
## Length:4795 Length:4795 Length:4795 Min. :1916-09-04
## Class :character Class :character Class :character 1st Qu.:1999-07-03
## Mode :character Mode :character Mode :character Median :2005-09-30
## Mean :2002-12-23
## 3rd Qu.:2011-02-14
## Max. :2017-02-03
## NA's :1
## budget revenue runtime vote_average
## Min. : 0 Min. :0.000e+00 Min. : 0.0 Min. : 0.000
## 1st Qu.: 800000 1st Qu.:0.000e+00 1st Qu.: 94.0 1st Qu.: 5.600
## Median : 15000000 Median :1.926e+07 Median :104.0 Median : 6.200
## Mean : 29092674 Mean :8.240e+07 Mean :106.9 Mean : 6.093
## 3rd Qu.: 40000000 3rd Qu.:9.312e+07 3rd Qu.:118.0 3rd Qu.: 6.800
## Max. :380000000 Max. :2.788e+09 Max. :338.0 Max. :10.000
## NA's :2
## vote_count popularity genres keywords
## Min. : 0.0 Min. : 0.000 Length:4795 Length:4795
## 1st Qu.: 54.0 1st Qu.: 4.724 Class :character Class :character
## Median : 236.0 Median : 12.963 Mode :character Mode :character
## Mean : 691.4 Mean : 21.527
## 3rd Qu.: 738.0 3rd Qu.: 28.352
## Max. :13752.0 Max. :875.581
##
## production_companies production_countries
## Length:4795 Length:4795
## Class :character Class :character
## Mode :character Mode :character
##
##
##
##
summary(tmdb_cast_crew)
## movie_id director producer actor_1
## Length:4803 Length:4803 Length:4803 Length:4803
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
## actor_2 actor_3
## Length:4803 Length:4803
## Class :character Class :character
## Mode :character Mode :character
sum(is.na(tmdb_cast_crew))
## [1] 1212
comments:
there is 1212 NA values out of 4,803 rows in , tmdb_cast_crew.
colSums(is.na(tmdb_cast_crew))
## movie_id director producer actor_1 actor_2 actor_3
## 0 30 1023 43 53 63
comments: imputing the missing values here will not be the optimal thing, we can remove the producer column then removing the remaining na values.
sum(is.na(tmdb_movies_metadata))
## [1] 956
comments:
there is 956 NA values out of 4,795 rows in , tmdb_movies_metadata. we can impute the missing values
colSums(is.na(tmdb_movies_metadata))
## movie_id title original_language
## 0 0 0
## release_date budget revenue
## 1 0 0
## runtime vote_average vote_count
## 2 0 0
## popularity genres keywords
## 0 28 409
## production_companies production_countries
## 345 171
comments: it seems that there is a three columns that have a lot of nan values like : keywords, production_companies, and production_countries. we can remove them simply and these columns are not important that much. then we can remoove the na values.
tmdb_cast_crew tmdb_movies_metadata
sum(duplicated(tmdb_cast_crew))
## [1] 0
comments:
no duplicates.
sum(duplicated(tmdb_movies_metadata))
## [1] 0
comments:
no duplicates.
#length(unique(df))
apply(tmdb_cast_crew, 2, function(x) length(unique(x)))
## movie_id director producer actor_1 actor_2 actor_3
## 4803 2352 1767 2096 2721 3096
#length(unique(df))
apply(tmdb_movies_metadata, 2, function(x) length(unique(x)))
## movie_id title original_language
## 4795 4792 37
## release_date budget revenue
## 3279 434 3296
## runtime vote_average vote_count
## 157 71 1609
## popularity genres keywords
## 4794 1175 4218
## production_companies production_countries
## 3695 469
data i will clean :
1-remove columns: keywords, production_companies, and production_countries from tmdb_movies_metadata producer from tmdb_cast_crew
2-remove na values
metadata = subset(tmdb_movies_metadata, select = -c(keywords,production_companies,production_countries))
crew = subset(tmdb_cast_crew, select = -c(producer))
metadata <- na.omit(metadata)
crew <- na.omit(crew)
colSums(is.na(metadata))
## movie_id title original_language release_date
## 0 0 0 0
## budget revenue runtime vote_average
## 0 0 0 0
## vote_count popularity genres
## 0 0 0
colSums(is.na(crew))
## movie_id director actor_1 actor_2 actor_3
## 0 0 0 0 0
metadata
boxplot( metadata$budget ,
main = "budget",
at = c(1),
names = c( "budget"),
las = 2,
#col = c("orange","red","green"),
border = "brown",
horizontal = TRUE,
notch = FALSE
)
boxplot( metadata$revenue ,
main = "revenue",
at = c(1),
names = c( "revenue"),
las = 2,
#col = c("orange","red","green"),
border = "brown",
horizontal = TRUE,
notch = FALSE
)
boxplot( metadata$runtime ,
main = "runtime",
at = c(1),
names = c( "runtime"),
las = 2,
#col = c("orange","red","green"),
border = "brown",
horizontal = TRUE,
notch = FALSE
)
boxplot(metadata$vote_average ,
main = "vote_average",
at = c(1),
names = c("vote_average"),
las = 2,
#col = c("orange","red","green"),
border = "brown",
horizontal = TRUE,
notch = FALSE
)
boxplot(metadata$vote_count ,
main = "vote_count",
at = c(1),
names = c("vote_count"),
las = 2,
#col = c("orange","red","green"),
border = "brown",
horizontal = TRUE,
notch = FALSE
)
boxplot( metadata$popularity ,
main = "popularity",
at = c(1),
names = c( "popularity"),
las = 2,
#col = c("orange","red","green"),
border = "brown",
horizontal = TRUE,
notch = FALSE
)
boxplot just shows that these column has outlaiers but we will keep them.
metadata
What are the genres that have the highest average rating?
metadata %>%
group_by (genres) %>%
summarise(max_rate = max(vote_average)) %>%
filter(max(max_rate) == max_rate)
Who are the highly rated directors? Who are the highly rated actors?
metadata %>% inner_join(crew) -> df
## Joining, by = "movie_id"
df %>%
group_by (director) %>%
summarise(max_rated_director = max(vote_average)) %>%
filter(max(max_rated_director) == max_rated_director) %>%
select (director)
df %>%
group_by (actor_1) %>%
summarise(max_rated_actor_1 = max(vote_average)) %>%
filter(max(max_rated_actor_1) == max_rated_actor_1) %>%
select (actor_1)
df %>%
group_by (actor_2) %>%
summarise(max_rated_actor_2 = max(vote_average)) %>%
filter(max(max_rated_actor_2) == max_rated_actor_2) %>%
select (actor_2)
df %>%
group_by (actor_3) %>%
summarise(max_rated_actor_3 = max(vote_average)) %>%
filter(max(max_rated_actor_3) == max_rated_actor_3) %>%
select (actor_3)
df %>%
select(budget,revenue, runtime , vote_average, vote_count, popularity) %>%
mutate(budget = (budget - mean(budget))/sd(budget)) %>%
mutate(revenue = (revenue - mean(revenue))/sd(revenue)) %>%
mutate(runtime = (runtime - mean(runtime))/sd(runtime)) %>%
mutate(vote_average = (vote_average - mean(vote_average))/sd(vote_average)) %>%
mutate(vote_count = (vote_count - mean(vote_count))/sd(vote_count)) %>%
mutate(popularity = (popularity - mean(popularity))/sd(popularity)) -> df_scaled
df_scaled
ggplot(df_scaled,
aes(x = vote_average,
y = budget)) +
geom_point()
as much the budget increases the vote average increases , there is a weak linear colinearity between the two features
ggplot(df_scaled,
aes(x = vote_average,
y = revenue)) +
geom_point()
as much the revenue increases the vote average increases , there is a weak linear colinearity between the two features
ggplot(df_scaled,
aes(x = vote_average,
y = runtime
)) +
geom_point()
as much the runtime increases the vote average increases , there is a weak linear colinearity between the two features
ggplot(df_scaled,
aes(x = vote_average,
y = vote_count)) +
geom_point()
as much the vote count increases the vote average increases , there is a weak linear colinearity between the two features
ggplot(df_scaled,
aes(x = vote_average,
y = popularity)) +
geom_point()
as much the popularity increases the vote average increases , there is a weak linear colinearity between the two features
#lets see the correlation between the independent variables
cor(df_scaled[,names(df_scaled)!="vote_average"])
## budget revenue runtime vote_count popularity
## budget 1.0000000 0.7293913 0.2648640 0.5904653 0.5013138
## revenue 0.7293913 1.0000000 0.2504608 0.7804131 0.6427591
## runtime 0.2648640 0.2504608 1.0000000 0.2710457 0.2189417
## vote_count 0.5904653 0.7804131 0.2710457 1.0000000 0.7767158
## popularity 0.5013138 0.6427591 0.2189417 0.7767158 1.0000000
Predictors are highly independent, that’s good!
df_scaled %>%
select(budget,revenue, runtime ,vote_count ,popularity, vote_average) ->df_scaled
below splits the df_scaled data set so that 80% is used for training a linear regression model and 20% is used to evaluate the model performance.
# Split the data into training and test set
set.seed(123)
training.samples <- df_scaled$revenue %>%
createDataPartition(p = 0.8, list = FALSE)
train.data <- df_scaled[training.samples, ]
test.data <- df_scaled[-training.samples, ]
# Build the model
model <- lm(vote_average ~., data = train.data)
# Make predictions and compute the R2, RMSE and MAE
predictions <- model %>% predict(test.data)
#Evaluate Model Using R2, EMSE , MAE
data.frame( R2 = R2(predictions, test.data$vote_average),
RMSE = RMSE(predictions, test.data$vote_average),
MAE = MAE(predictions, test.data$vote_average))
Comment: we can see that the model don’t fitting the data very well as the r2 is very low by 18%
the prediction error rate, which should be as small as possible
RMSE(predictions, test.data$vote_average)/mean(test.data$vote_average)
## [1] 36.13323
it is high!
#K-fold cross-validation
# Define training control
set.seed(123)
train.control <- trainControl(method = "cv", number = 10)
# Train the model
model <- train(vote_average ~., data = df_scaled, method = "lm",
trControl = train.control)
# Summarize the results
print(model)
## Linear Regression
##
## 4715 samples
## 5 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 4243, 4245, 4244, 4242, 4243, 4243, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 0.8848452 0.2180938 0.6040948
##
## Tuning parameter 'intercept' was held constant at a value of TRUE